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Abstract 

This report details an approach to improve the accuracy and precision of free energy 
difference estimates using thermodynamic integration data (slope of the free energy with 
respect to the switching variable A) and its application to calculating solvation free energy. 
The central idea is to utilize polynomial fitting schemes to approximate the thermody- 
namic integration data to improve the accuracy and precision of the free energy differ- 
ence estimates. In this report we introduce polynomial and spline interpolation techniques. 
Two systems with analytically solvable relative free energies are used to test the accu- 
racy and precision of the interpolation approach (Shyu and Ytreberg, J Comput Chem 
30: 2297-2304, 2009). We also use both interpolation and extrapolation methods to de- 
termine a small molecule solvation free energy. Our simulations show that, using such 
polynomial techniques and non-equidistant A values, the solvation free energy can be es- 
timated with high accuracy without using soft-core scaling and separate simulations for 
Lennard- Jones and partial charges. The results from our study suggest these polynomial 
techniques, especially with use of non-equidistant A values, improve the accuracy and pre- 
cision for AF estimates without demanding additional simulations. To allow researchers 
to immediately utilize these methods, free software and documentation is provided via 
: / / www ■ phys . uidaho . edu/ ytreberg/ software 



1 Introduction 

The Helmholtz or Gibbs free energy constitutes an important thermodynamic quantity to un- 
derstand how chemical species recognize each other, associate or react. Examples of such 
problems include conformational equilibria and molecular association, partitioning between im- 
miscible liquids, receptor-drug interaction, protein-protein and protein-DNA association, and 
protein stability [H [2] . Thermodynamic integration (TI) is a commonly used approach for the 
calculation of free energy differences (AF) between two systems with potential energy functions 
Ui and Uq, respectively jSj [U O [H [7] . The free energy difference, AF = Fi — Fq, is equivalent to 



the reversible work to switch from Uq to Ui, and can be determined by estimating the integral 



The notation {dU\/ d\) indicates an ensemble average at a particular value of A. The variable 
A permits continuously switching from one energy function to another. Switching potential 
energies requires a continuously variable energy function U\ such that C/a=o = Uq and U\=i = 
Ui- In addition, the free energy function Ux must be differentiable with respect to A for 



the integral must be approximated by performing simulations at various fixed, discrete, values 
of A. Typically, these discrete A values are used to convert the integral to a sum (e.g., using 
quadrature). If the estimate of (dUx/dX) is slow to converge, then it is necessary to perform 
very long simulations in order reliably estimate the average. In addition, (dUx/dX) ^ may heavily 
depend on A so that a large number of fixed A simulations is needed in order to estimate the 
integral with sufficient accuracy. 

Previously, Shyu and Ytreberg [9] reported the use of regression techniques to calculate the 
free energy from TI data. Using analytical solvable examples, the authors were able to achieve 
highly accurate estimates. We note that regression does not produce a curve that goes through 
every data point exactly. Instead the regression model only describes the tendency of the data 
and does not represent the functional form. 

The purpose of this current study is to introduce polynomial and spline interpolation tech- 
niques to estimate AF and to use both interpolation and regression methods to estimate a small 
molecule solvation free energy. This differs from our previous study in that these polynomials 
interpolate the slope of the free energy dF/dX = (dUx/dX) ^ as a function of A. The derived 
curve is forced to pass through each data point. Regression, by contrast, constructs the curve 
that minimizes the errors between the data and extrapolated points. The key motivation for this 
study is that, even if the averages (dUx/dX) ^ are fully converged (i.e., infinitely long sampling), 
there will be error in the AF estimates due to the fact that one must estimate the integral. Here 
we present methods which reduce this error and construct the polynomial that represents the 
functional form of TI data. We also examine the use of both equidistant and non-equidistant A 
values. Two test systems previously used by Shyu and Ytreberg [9] with analytical AF values 
were utilized to quantify the accuracy and precision of the interpolation techniques. We also 
used both regression and interpolation techniques to estimate a small molecule solvation free 
energy. The results from our simulations suggest that polynomial fitting, especially with use 
of non-equidistant A values, improves the accuracy and precision for AF estimates without 
demanding additional simulations. In addition, polynomial fitting techniques permit accurate 
estimates of free energy difference from TI without the need for soft-core scaling and separate 
simulations for partial charges and Lennard- Jones potentials. 
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2 Theory 



The primary focus of this study is to present a mathematical framework for the analysis of fixed 
TI simulation data using polynomial and spline interpolation techniques. The objective is then 
to construct a polynomial that best fits the TI data. The functional form of the simulation data 
is represented by a series of data points {A, dF/dX}, thus a polynomial is constructed through 
these data. Interpolation refers to the problem of determining a function that exactly represents 
given data points. Mathematically the polynomial obtained from interpolation is considered to 
be the one that best represents all data points within the given interval (A = [0, 1]) \U)\ 

Polynomial interpolation and regression techniques generally give a similar result, especially 
when high degrees of polynomials are employed to fit the TI data. However, the curve obtained 
from polynomial interpolation passes though each data point whereas the curved generated 
via regression does not. If TI simulations are fully converged (i.e., infinite sampling), then 
polynomial interpolation should give the most accurate and precise estimate. Regression, on 
the other hand, utilizes the least squares method that minimizes the sum of the square of 
the difference between the actual value and the approximated one. Regression does have an 
advantage over interpolation in that the approximation is independent of the number of data 
points. However, we note that regression may not give the best estimates if the TI data include 
several inflection points due to the fact that regression does not attempt to construct a curve 
that passes through every data point. 

2.1 Mathematical Notation 

To simplify the mathematical expressions and to be consistent with the most commonly used 
notation in polynomial and spline interpolation papers, we denote the switching variable A by x, 
and, dF/dX by y. We also denote the actual continuous function dF/dX by / (x). The functional 
form of the simulation data is thus represented by a series of data points {A, dF/dX} = {x, y}, 
and the approximating function p{x) is constructed through these data points. The following 
sections briefly introduce mathematical definitions and properties of both Lagrange and Newton 
polynomials, and the cubic spline function. 

2.2 Lagrange and Newton Polynomial Interpolation 

The classical Weierstrass theorem establishes the mathematical foundation for polynomial ap- 
proximation [T2]. The theorem asserts that there exists a polynomial p{x) for approximating 
the continuous function / (x) defined within a closed interval [o, b] = [0, 1], and the polynomial 
approximation p (x) can get arbitrarily close to the function / (x) as the degree of polynomial is 
increased [13]. The most straightforward method to obtain the polynomial p (x) of degree n is to 
calculate the values of / (x) for n + 1 distinct fitting data points within the interval of a < x < 6, 
and satisfy the simultaneous equation yi = f (xj) = p (xj), for z = 0, 1, 2, ... , n. Lagrange and 
Newton polynomials are the most commonly used methods for interpolation \13\ [T5] . 

The coefficients of the above polynomial p (x) then form the basis of the Lagrange inter- 
polation formula. The Weierstrass theorem contends that there is always a unique polynomial 
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p (x) of degree n that satisfies these conditions. The Lagrange interpolation polynomial L (x) 
is a linear combination of Lagrange basis polynomials /j (x) such that 



L{x)=Y.y^kix)=J:y^ n • (2) 

It is worth noting that the degree of polynomial n cannot be chosen arbitrarily but is deter- 
mined by the number of data points n + 1. This is important since the degree of interpolating 
polynomials will thus be determined by the number of A values used to estimate AF. 

Mathematically, the Lagrange form of the interpolation polynomial is generally preferred in 
proof and theoretical arguments because the derivatives of the polynomial are continuous, and 
maxima or minima always exist \10\ lllj. The construction of Lagrange polynomials, however, 
is computationally demanding because all Lagrange basis polynomials have to be reevaluated 
each time is updated. By contrast, Newton interpolation utilizes the divided difference (more 
on this below) which is generally more computationally efficient |14|. 

A more practical form of interpolation polynomial for computational purposes is given 
by Newton polynomials that utilize the divided difference. The divided difference is defined 
as the ratio of the difference in the function values, yi or / (xj), at two points divided by the 
difference in the values of the corresponding independent variable, Xi. Divided differences do not 
require the recalculations of coefficients if new data points are included. As a consequence, it is 
generally more computationally efficient to use the divided differences method for interpolation 
|15| . Similar to the Lagrange polynomial, the Newton polynomial is a linear combination of 
basis polynomials 



N{x) = J2ai 



n i^-xj) 



j=Q 



(3) 



where the coefficients Oj = g [xq, . . . , Xi] is the notation for divided differences. The first divided 
difference, for example, between data points xq and xi is given by 

f{xi)-f{xo) yi-yo 



0-1= g [xo,xi 

Xi — Xq Xi — Xq 

The second divided difference between data points xq, xi, and X2 is given by 

g[xi,x2\ - g[xo,xi] 

X2- Xq 

Accordingly the n-th divided difference between data points xo,xi, . . . ,Xn is given by 

g[xi,X2, ...,Xn]- g[xo, Xi, . . . , Xn^l] 



0-2=9 [xo,Xi,X2] 



9 [XQ.Xi, 



1 Xr, 



Xo 



(4) 



(5) 



(6) 



The Newton interpolation polynomial is then given by 



A^(x) = f{xQ)+g[xQ,xi]{x-xo)+g[xQ,xi,X2]{x-XQ){x-xi) + 
... + g[xQ,xi, . . . ,Xi](x - xo)(x - xi) . . . (x - Xi_i). 



(7) 



The calculations of the divided differences form a successive, recurrent relationship between the 
previous two coefficients. Computationally, the divided differences can be written in the form 
of a table that significantly simplifies the algorithmic implementation [T5]. 
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2.3 Spline Function Interpolation 

Studies have suggested that the spHne function is typically preferred over polynomials because 
the interpolation error can be reduced by using low degrees of polynomials |16l \T7\ [T8] . Spline 
interpolation also avoids the Runge's phenomenon which occurs when using high degree poly- 
nomials [E]. Runge's phenomenon is when approximation errors increase dramatically as the 
degree of interpolation polynomials increases. 

Spline functions are piecewise polynomials of the same degree |17l [20] . Unlike the Lagrange 
and Newton interpolation polynomials, the degree of the piecewise polynomials does not de- 
pend on the number of data points. The piecewise polynomials join together at the data points 
{X,dF/dX} called knots and must be continuous [10]. The main advantages of spline inter- 
polation are stability and simplicity [21j. Moreover, the use of polynomials of lower degree 
offers the possibility to avoid the oscillatory behavior that commonly arises from fitting a single 
polynomial exactly to a large number of data points |10] . 

We utilize the cubic spline which is the most commonly used type. A cubic spline function 
is a set of third degree piecewise polynomials that provide a smooth curve passing through 
all data points. Because the segments join with matching derivatives up to order two, the 
curvature of the polynomials changes smoothly along the knots. The fundamental idea behind 
cubic spline interpolation is based on the engineer's tool used to draw smooth curves through 
a number of points. This spline consists of weights attached to a flat surface at the points to 
be connected. A flexible strip is then bent across each of these weights, resulting in a smooth 
curve. The mathematical spline is similar in principle. The points, in this case, are TI data. 
The weights are the coefficients of the cubic polynomials used to interpolate the data [22] . 

Mathematically, the cubic spline function Si (x) is defined as 

Si (x) = ^ {X - Xif + ^ (Xi+l - xf + 

bhi brii 

("if " ^''^') " + " ^'') " ' 

where hi = Xj+i —Xi is the size of the i-th. interval [xj, Xj+i]. The spline function Si (x) provides 
a cubic polynomial on the interval x G The values of Zi are the second derivatives at 

the the endpoints such that Zi = S'l (xj). To compute the values of Zi, it is necessary to solve 
the recurrent equation 

hi-iZi_i + 2 {hi_i + hi) Zi + hiZi+i = 6 ^ ^^+^^ _ ^ ^'^^ ^ , (9) 

for i = to n — 1. 

In the current study, natural spline boundary conditions were implemented to construct the 
interpolation polynomials. There are two boundary condition types which could be utilized. 
The natural cubic spline imposes the boundary conditions ^^o = and Zn = 0, and the clamped 
cubic spline requires zq and Zn for a given function [161 113 • When the natural boundary 
conditions are used, the spline assumes the shape that a long flexible curve would take if it is 
forced to go through all the data points. A natural spline permits the slope at the ends be free 
to equilibrate to the position that minimizes the oscillatory behavior of the curve. A clamped 
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cubic spline, on the other hand, further requires a piecewise cubic function which passes through 
the given set of knots with a given function or value. The derivatives and second derivatives 
of adjacent cubic polynomials must also agree at the interior abscissae. It is worth noting 
that spline interpolation polynomials are piecewise and do not always approximate the dF/dX 
completely. This is particularly evident when the choice of A neglects the inflection points along 
the curve. 



2.4 Algebraic Formulation 

To minimize the numerical errors in the AF estimation process, interpolating polynomials were 
first transformed into their analytical forms, and then all integrals were evaluated analytically. 
The derivation of the analytical integration form for the interpolating polynomials involves the 
expansion of every coefficient in the polynomial equation. The rearrangement of coefficients 
permits analytical integration of the interpolating polynomials. The analytical integral for the 
Lagrange polynomial is expressed as 



i=0 



L{x)dx = ^yi j li{x)dx = ^yi j J| x- -x- 



i=0 



- — —dx. 



(10) 



Similarly, the analytical integral for the Newton polynomial is expressed as 

, i-l 



/ N (x) dx = ^ (^i Y\.^^ ~ 

i=0 j=0 



(11) 



where the coefficient a, is the notation for divided differences; see Equation |3j 

For spline functions, analytical integrations must be performed at each subinterval. Similar 
to that of Lagrange and Newton polynomial, the integration form requires the expansion and 
rearrangement of all coefficients. The analytical integral form of the cubic spline is expressed 
as 



Si (x) dx 



1 

6/i, 
1 

2h. 



{Zi+i - Zi) 



+ x^ 



{ZiXi+l - Zi+iXi) 

hi 



1 

2hi 

+ ^ iVi+i -Vi)-'^ {zi+i - Zi) 



1 / o s\ 1 / \ / \ 

— yziXi_^_^ - Zi+iXi ) + J- [yiXi+i - Vi+iXi) + — [Zi+iXi - ZiXi+i) . 



(12) 



Similar to that of Lagrange and Newton interpolation polynomials, each Xi is replaced by 
Xi and yi by (dUx/dX) ^. Specifically, for Lagrange interpolation 

1 „ 1 





and for Newton interpolation 

1 



n 



A- A,- 



i^i,i=o - 



dX (Lagrange) 



(13) 



, i-l 



AF = J N{X)dX = J2(^i J ^i) dX (Newton) 

i=o ■?=o 



(14) 
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with the divided difference Ui = g [Aq, Ai, . . . , Aj] given by Equation [6j To calculate AF using 
cubic spline, integrals are evaluated at each subinterval [Ai_i, Aj], and the total AF is the sum 
of all integrals 

n 

AF = Y, I Si (A) dX (Cubic Spline) (15) 
i=i -'^^-^ 

where Si are given by Equation |8j 

3 Computational Details 

Two types of free energy calculations were performed in order to test our interpolation ap- 
proaches. The first type utilized two test systems with analytical AF solutions [9]. Simulations 
were performed using the same protocols as reported by Shyu and Ytreberg [9] . These test sys- 
tems provide an unbiased measure of the accuracy and precision of the interpolation techniques 
for estimating AF. 

The second type of simulation was to compute the solvation free energy for a 4-hydroxy- 
2-butanone (BUQ) molecule [26]. We used both regression [9| and interpolation techniques 
and compared them to use of quadrature. Details of the solvation free energy calculation are 
described below. 

3.1 Test Systems 

Procedures for the simulations of the two test systems were previously described by Shyu 
and Ytreberg p]. Briefly, these systems involve two potential functions Uo{^) = and 
Ui (^) = 2 (,^ — 5)^ for system one, and Uq (^) = and Ui (^) = 5 (? — 5)^ for system two with 
the analytical free energy AF = — ^ In j and AF = — ^ In 5 respectively |9] . An equal amount of 
simulation time (1,000,000 Monte Carlo steps) was spent for each of six (A = 0.0, 0.2, 0.4, 0.6, 0.8, 
and 1.0) and eleven (A = 0.0,0.1,0.2, . . . ,0.9, and 1.0) equidistant A values. Identical equilib- 
rium simulations were performed on the corresponding six non-equidistant A values (A = 0.0, 
0.0955, 0.3455, 0.6545, 0.9045, and 1.0) and eleven (A = 0.0, 0.0245, 0.0955, 0.2061, 0.3455, 
0.5, 0.6545, 0.7939, 0.9045, 0.9755, and 1.0). Averages of the slope dF/d\ = {dUx/d\)^ were 
collected for each value of A. For each A value, simulations were given 1,000 steps to equilibrate 
initially, then were allowed to proceed for 1,000,000 Monte Carlo steps. Each trial started with 
an arbitrarily chosen initial position for the particle. Simulations were performed sequentially 
starting at A = 0. Trial moves for Metropolis Monte Carlo [25j were generated by adding a 
uniform random deviate between -0.5 and 0.5 to the current position resulting in a 40 to 50% ac- 
ceptance ratio. A total of 1,000 independent trials were generated from each case. We recorded 
the biases (AFg^act ~ ■^-^estimate) each of 1,000 independent runs and the corresponding 
averages and standard deviations were calculated in order to provide a measure of the accuracy 
and precision of the AF estimates. 
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3.2 BUQ Solvation 



We applied our interpolation technique to compute the solvation free energy for a BUQ molecule 
|26j . We first obtained a reference estimate for the solvation free energy AF^.^^. To obtain 
Ai^j.g£ with high accuracy, we performed separate simulations for changes in the Lennard-Jones 
parameters and the partial charges [27]. Studies have shown that use of separate simulations 
provides a higher accuracy estimate and avoids possible endpoint singularity problems [6] [271 [28]. 
Thus, the first stage to obtaining AF^^^ was to compute the free energy associated with the 
Lennard-Jones potential. This was accomplished by setting all the partial charges of the solute 
to zero. We then "grew" the BUQ molecule in the solvent using soft-core scaling to ensure a 
smooth potential energy curve, even for interpenetrating molecules |28| I29j. Once the neutral 
atoms are fully grown in the solvent, we performed the second stage to compute the free energy 
associated with the partial charges. This was accomplished by increasing the partial charges 
from zero to their final values given by the forcefield. For both stages, we employed trapezoidal 
quadrature to integrate the free energy slope and obtain the free energy for each of the two 
stages. We computed AF^^f using this two-stage approach for both inserting and deleting BUQ. 

The initial coordinates for the BUQ were extracted from the FKBP-ligand complexes (1D7J) 
|26j which were retrieved from the Protein Data Bank. The topologies for BUQ were then 
generated by the PRODRG server [30]. The simulations were performed using the GROMACS 
package 3.3.3 [31J at constant temperature in a periodic cubic box with an edge length of 
approximately 2.4 nm and the default GROMOS-96 43A1 forcefield. This corresponded to 
approximately 500 SPG [32] water molecules. The neighbor list for nonbounded interactions 
was updated every 10 steps. The bond distances and bond angles of water were constrained 
using the LINGS algorithm |33| . To obtain a isothermal-isobaric ensemble at 300 K, a leap-frog 
stochastic dynamics [34| was used to integrate the equations of motion with a 2.0 fs timestep. 
The temperature was maintained using Langevin dynamics with a friction coefficient of 1.0 
amu/ps. The pressure was maintained at 1.0 atom using the Parrinello- Rahman algorithm |35j . 
The coupling time was set to 1.0 ps, and the isothermal compressibility was set to 4.5 x 10~^ 
bar~^. Particle- mesh summation Ewald algorithm \36\ I37j was used for electrostatics with a 
real-space cutoff of 1.0 nm and a Fourier spacing of 0.1 nm. Van der Waals interactions used a 
cutoff with a smoothing function such that the interactions smoothly decayed to zero between 
0.75 nm and 0.9 nm. Dispersion corrections for the energy and pressure were also utilized |38j . 

The BUQ system was first minimized using steepest descent for 500 steps. 100 ps of con- 
stant volume simulation were performed for system equilibration, followed by 100 ps of constant 
pressure simulation. Then 10 ns constant pressure simulations were preformed with eleven 
equidistant A values (A = 0.0, 0.1, 0.2, 0.9, and 1.0). Simulations were performed sequen- 
tially starting at A = and averages of the slope dF/dX = (dUx/dX) ^ were collected for each 
value of A. For the first stage, we computed the free energy for the Lennard-Jones interac- 
tions using a soft-core scaling parameter of a = 0.5. For the second stage, we computed the 
free energy for the partial charge interactions using a = 0.0 [27]. Four additional A values 
(A = 0.65,0.75,0.85,0.95 for deleting BUQ and A = 0.05,0.15,0.25,0.35 for inserting) were 
included to the simulations for Lennard-Jones potentials to further refine the AF soft-core esti- 
mate where the free energy slopes undergo the most change. The total free energy is the sume 
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of the free energies from simulations for the Lennard-Jones and partial charges. We determined 
that AF^^I^ = 31.6 kJ/mol for deleting BUQ and AF™| = -31.5 kJ/mol for inserting. 

4 Results and Discussion 

We now summarize the results from our polynomial fitting methods applied to two test systems, 
and to BUQ solvation. The analytical test systems provide an objective assessment of the 
accuracy and precision of the AF estimates. The BUQ solvation provides a demonstration of 
our approach on a common chemical problem. For the results below, interpolation polynomials 



for each set of TI data were first transformed into the analytical form (Equations 13 14 and 



15) and integration was then performed analytically. Results presented below show that use of 



interpolation with non-equidistant A values results in accurate and precise AF estimates. 
4.1 Test Systems 

Table [T] summarizes the averages and standard deviations of biases for system one and two from 
1,000 independent trials using six and eleven equidistant and non-equidistant A values. Using 
equidistant A values, the estimates obtained from the cubic spline and trapezoidal quadrature 
are both heavily biased. As the number of A values increases, the accuracy and precision 
improve. Using eleven equidistant A values, for example, the Lagrange and Newton polynomials 
give the most accurate estimates of AF but with large uncertainty compared to that of the 
cubic spline and trapezoidal quadrature. The estimates obtained from cubic spline are slightly 
biased but with a low uncertainty. Using non-equidistant A values, the Lagrange and Newton 
polynomials are able to most accurately estimate AF. These results suggest that using non- 
equidistant A values is preferred to equidistant A values. Quadrature results, however, do 
not improve with the use of non-equidistant A values. We also note that, unlike quadrature, 
the steep free energy curvature for system two does not appear to significantly decrease the 
accuracy of AF estimates obtained from the Lagrange and Newton polynomials. These results 
are consistent with our previous regression study [9j. 

Overall, the Lagrange and Newton polynomials provide the most accurate estimates, but 
often with larger uncertainty than that of spline or quadrature. Results from additional cursory 
simulations suggest that the variation of the AF estimates are predominately caused by the 
oscillations from the use of high degree polynomials (data not shown) . Use of non-equidistant A 
values significantly improves the numerical stability and increases the accuracy and precision of 
AF estimates when using polynomial interpolation. The use of non-equidistant A values, how- 
ever, provides only slight improvement of the estimates obtained from the trapezoidal quadra- 
ture compared to that of equidistant. The results also show that the AF estimates obtained 
from spline interpolation are more accurate than that of trapezoidal rule. The estimates from 
quadrature are all heavily biased. The estimates from the cubic spline have similar uncertainty 
but are not as accurate as that of Lagrange and Newton. Thus, we conclude that use of both 
polynomial and spline interpolation techniques with non-equidistant A values provide accurate 
estimates of AF. 
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System One 



Six Equid. Six Non-Equid. Eleven Equid. Eleven Non-Equid. 



Trapezoidal Rule 
Cubic Spline 
Lagrange /Newton 


1.2496 (0.0210) 
0.4686 (0.0198) 
0.1606 (0.0200) 


1.1390 (0.0212) 
-0.1138 (0.0218) 
-0.0208 (0.0212) 


0.3270 (0.0152) 
0.0752 (0.0152) 
0.0034 (0.0377) 


0.3073 (0.0150) 
-0.0047 (0.0152) 
0.0006 (0.0151) 


System Two |9] 




Six Equid. 


Six Non-Equid. 


Eleven Equid. 


Eleven Non-Equid. 


Trapezoidal Rule 
Cubic Spline 
Lagrange /Newton 


-1.8944 (0.0232) 
-0.8228 (0.0212) 
-0.3495 (0.0210) 


-1.4944 (0.0231) 
0.2035 (0.0234) 
0.0467 (0.0227) 


-0.5077 (0.0161) 
-0.1424 (0.0159) 
-0.0053 (0.0335) 


-0.4094 (0.0154) 
0.0093 (0.0156) 
-0.0006 (0.0154) 



Table 1: Averages and standard deviations of the biases (AFg^^g^^^ — AFgg^-^^g^^g) for system 
one and two using six and eleven equidistant and non-equidistant A values. The exact solution 
for system one is AF^^^^^ = — ^In^ and for system two AF^^^^^ = — ^InS as previously 
reported by Shyu and Ytreberg [9]. Standard deviations of the biases are shown in parentheses. 



Del. Equid. Del. Non-Equid. Ins. Equid. Ins. Non-Equid. 

Trapezoidal Rule -40.2 4L5 39^6 9^ 

Cubic Spline -26.8 1.0 27.0 -1.1 

Lagrange/Newton -14.0 -0.4 14.5 0.4 

Regression -14.3 -0.4 14.5 0.3 



Table 2: Free energy biases (AFj.g£ — AF^g^-jj^g^^g) using eleven equidistant and non-equidistant 
A values. The reference free energy estimates (AF^^|^ and AF™^^) were obtained from a two- 
stage simulation using soft-core scaling and are utilized as reference values to evaluate the 
polynomial techniques. The free energy for BUQ deletion is AF^^ = 31.6 kJ/mol and insertion 

AF^^F = -31.5 kJ/mol. 



4.2 BUQ Solvation 

As detailed above, we first performed a two-stage simulation for Lennard-Jones and partial 
charges in order to obtain the reference solvation free energies AF™| and AF^d ^g ^^^^^ ^g. 
termined the free energy estimate with no soft-core scaling utilizing a single stage, i.e., Lennard- 
Jones and partial charges were simultaneously changed. Figure [T] shows the free energy curves 
from the simulations using eleven equidistant (b) and non-equidistant A values (a). The curves 
are all well-behaved except when A is close to either 0.0 or 1.0. As expected, the figures show 
clear evidence of the endpoint singularity when the vanishing or growing atoms become close 
to the other atoms. The statistical uncertainty progressively grows larger as A approaches 0.0 
or 1.0. The rapid changes on the free energy curves at the endpoint introduce significant errors 
into the numerical integration using trapezoid quadrature. It is worth noting that the use of 
non-equidistant A significantly alleviates the endpoint singularity problem, even for quadrature, 
because more data points are used at each end. 

Table [2] summarizes the AF estimates obtained using both interpolation and regression 
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Figure 1: Solvation free energy curves for BUQ with eleven equidistant (a) and non-equidistant 
(b) A values. Both insertion (solid curve) and deletion (dotted curve) were computed. The 
fitting curves shown were constructed using Lagrange interpolating polynomials. Note that the 
endpoint singularity problem is present when the growing and vanishing atoms are too close 
to the other atoms (A = 0.0 or 1.0). The use of non-equidistant A values allows a smoother 
polynomial fit. 



techniques with equidistant (Table [2]A.) and non-equidistant A values (Table [2^). The refer- 
ence free energies AF^^^ and AF™£ were used to estimate the accuracy and precision of these 
polynomial fitting techniques. For equidistant A values, while none of the integration tech- 
niques give an accurate estimate of AF, Lagrange and Newton provide the best results. The 
use of non-equidistant A values, significantly improves the accuracy of all AF estimates. The 
improvement is particularly evident with the estimates obtained using Lagrange and Newton 
polynomials. Cubic spline interpolation delivers slightly biased estimates but is still consider- 
ably more accurate than that of trapezoidal quadrature. 

For BUQ solvation, our study has shown that using the polynomial fitting techniques for 
interpolation and extrapolation and use of non-equidistant A values, AF can be estimated 
within the accuracy of ±0.4 kJ/mol without using soft-core scaling and performing separate 
simulations for Lennard-Jones or partial charges. Steinbrecher et al. [27] showed that the 
soft-core potentials are most promising for alchemical TI simulations and deliver AF estimates 
with accuracies of 0.1 kcal/mol. However, our simulations suggest that similar accuracy can 
be achieved using non-equidistant A values and polynomial fitting techniques without soft-core 
scaling. 



5 Conclusion 

We have implemented polynomial interpolation and regression techniques to estimate a small 
molecule solvation free energy using thermodynamic integration (TI) simulation data. Our 
study utilized two test systems with analytical AF values to objectively quantify the accuracy 
and precision of the interpolation techniques. We also used both interpolation and regression to 
estimate the solvation free energy for BUQ. These polynomial approaches rely on constructing 
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globally optimal polynomials that best fit the TI data, which are then used to estimate AF. 
Additional algebraic calculations are done to permit analytical integration eliminating potential 
errors due to numerical evaluation on the integral. 

Studies have shown that the use of soft-core scaling improves the accuracy of AF estimates 
within ±0.1 kJ/mol for alchemical TI simulations [27J. Our simulations, however, suggest that 
similar accuracy can also be achieved using non-equidistant A and polynomial interpolation or 
regression techniques without soft-core scaling. 

We have shown that polynomial and spline interpolation techniques provide more accurate 
and precise AF estimates than trapezoidal quadrature for the systems studied here. However, 
we caution that polynomial interpolation techniques possess some inherent weakness. The 
most significant is that use of more than eleven A produces interpolation polynomials that 
are numerically unstable. As a consequence, our approach is limited to use of eleven or less 
equidistant or non-equidistant A values or one may judicially select subsets of the A data that 
best represent the underlying functional form of the free energy curve. 

Our results also illustrate the importance of selection of A values if one wishes to use poly- 
nomial interpolation techniques to improve the accuracy of AF estimates. For the systems 
studied here, use of non-equidistant A always produced more accurate AF estimates than use 
of equidistant A. Using twelve or less A values the Lagrange and Newton polynomials always 
produced the highest accuracy. The AF estimates via cubic spline were more accurate than 
quadrature but less accurate than Lagrange and Newton. We note that for data containing 
more twelve A values, cubic spline is expected to produce more accurate results than Lagrange 
and Newton because of superior numerical stability. 

To better facilitate the use of polynomial fitting techniques, we propose several guidelines for 
estimating AF. Researchers who already have very well converged TI data with eleven or less 
A values are encouraged to utilize the Lagrange or Newton polynomial interpolation approach. 
For researchers who already have data from equilibrium simulations with twelve or more A 
values, the spline interpolation technique should be used instead. Researchers that have not yet 
generated data are strongly encouraged to use non-equidistant A values and polynomial fitting 
techniques. Finally, if the TI data is noisy or not very well converged then regression [9] should 
be used instead of interpolation. To allow researchers to immediately utilize these methods, 
free software and documentation is provided via ,http : //www . phys . uidaho . edu/ytreberg/] 
[jof tware. 
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